model icon

scgpt

Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI

scGPT

This is the official codebase for scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI.

Preprint

Documentation

!UPDATE: We have released several new pretrained scGPT checkpoints. Please see the Pretrained scGPT checkpoints section for more details.

Installation

scGPT is available on PyPI. To install scGPT, run the following command:

$ pip install scgpt

[Optional] We recommend using wandb for logging and visualization.

$ pip install wandb

For developing, we are using the Poetry package manager. To install Poetry, follow the instructions here.

$ git clone this-repo-url
$ cd scGPT
$ poetry install

Note: The flash-attn dependency usually requires specific GPU and CUDA version. If you encounter any issues, please refer to the flash-attn repository for installation instructions. For now, May 2023, we recommend using CUDA 11.7 and flash-attn<1.0.5 due to various issues reported about installing new versions of flash-attn.

Pretrained scGPT Model Zoo

Here is the list of pretrained models. Please find the links for downloading the checkpoint folders. We recommend using the whole-human model for most applications by default. If your fine-tuning dataset shares similar cell type context with the training data of the organ-specific models, these models can usually demonstrate competitive performance as well.

Model name Description Download
whole-human (recommended) Pretrained on 33 million normal human cells. link
brain Pretrained on 13.2 million brain cells. link
blood Pretrained on 10.3 million blood and bone marrow cells. link
heart Pretrained on 1.8 million heart cells link
lung Pretrained on 2.1 million lung cells link
kidney Pretrained on 814 thousand kidney cells link
pan-cancer Pretrained on 5.7 million cells of various cancer types link

Fine-tune scGPT for scRNA-seq integration

Please see our example code in examples/finetune_integration.py. By default, the script assumes the scGPT checkpoint folder stored in the examples/save directory.

To-do-list

  • Upload the pretrained model checkpoint
  • Publish to pypi
  • Provide the pretraining code with generative attention masking
  • Finetuning examples for multi-omics integration, cell type annotation, perturbation prediction, cell generation
  • Example code for Gene Regulatory Network analysis
  • Documentation website with readthedocs
  • Bump up to pytorch 2.0
  • New pretraining on larger datasets
  • Reference mapping example
  • Publish to huggingface model hub

Contributing

We greatly welcome contributions to scGPT. Please submit a pull request if you have any ideas or bug fixes. We also welcome any issues you encounter while using scGPT.

Acknowledgements

We sincerely thank the authors of following open-source projects:

Citing scGPT

@article{cui2023scGPT,
title={scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI},
author={Cui, Haotian and Wang, Chloe and Maan, Hassaan and Pang, Kuan and Luo, Fengning and Wang, Bo},
journal={bioRxiv},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}

Related notebook

BioTuring

Spatial charting of single-cell transcriptomes in tissues - celltrek

Single-cell RNA sequencing methods can profile the transcriptomes of single cells but cannot preserve spatial information. Conversely, spatial transcriptomics assays can profile spatial regions in tissue sections but do not have single-cell resolution. Here, Runmin Wei (Siyuan He, Shanshan Bai, Emi Sei, Min Hu, Alastair Thompson, Ken Chen, Savitri Krishnamurthy & Nicholas E. Navin) developed a computational method called CellTrek that combines these two datasets to achieve single-cell spatial mapping through coembedding and metric learning approaches. They benchmarked CellTrek using simulation and in situ hybridization datasets, which demonstrated its accuracy and robustness. They then applied CellTrek to existing mouse brain and kidney datasets and showed that CellTrek can detect topological patterns of different cell types and cell states. They performed single-cell RNA sequencing and spatial transcriptomics experiments on two ductal carcinoma in situ tissues and applied CellTrek to identify tumor subclones that were restricted to different ducts, and specific T-cell states adjacent to the tumor areas.

T-cells

More

BioTuring

Required GPU

scVI-tools: single-cell variational inference tools

scVI-tools (single-cell variational inference tools) is a package for end-to-end analysis of single-cell omics data primarily developed and maintained by the Yosef Lab at UC Berkeley. scvi-tools has two components - Interface for easy use of a range of probabilistic models for single-cell omics (e.g., scVI, scANVI, totalVI). - Tools to build new probabilistic models, which are powered by PyTorch, PyTorch Lightning, and Pyro.

Ionocytes

Club cells

More

BioTuring

Hierarchicell: estimating power for tests of differential expression with single-cell data

Power analyses are considered important factors in designing high-quality experiments. However, such analyses remain a challenge in single-cell RNA-seq studies due to the presence of hierarchical structure within the data (Zimmerman et al., 2021). As cells sampled from the same individual share genetic and environmental backgrounds, these cells are more correlated than cells sampled from different individuals. Currently, most power analyses and hypothesis tests (e.g., differential expression) in scRNA-seq data treat cells as if they were independent, thus ignoring the intra-sample correlation, which could lead to incorrect inferences. Hierarchicell (Zimmerman, K.D. and Langefeld, C.D., 2021) is an R package proposed to estimate power for testing hypotheses of differential expression in scRNA-seq data while considering the hierarchical correlation structure that exists in the data. The method offers four important categories of functions: data loading and cleaning, empirical estimation of distributions, simulating expression data, and computing type 1 error or power. In this notebook, we will illustrate an example workflow of Hierarchicell. The notebook is inspired by Hierarchicell's vignette and modified to demonstrate how the tool works on BioTuring's platform.

Ionocytes

Respiratory ciliated cells

More